R Advanced Course 2025

Differential Gene Expression

Ulrike Göbel

Bioinformatics Core Facility CECAD

2025-11-20

Slides & Code

  • [f] Full screen
  • [o] Slide Overview
  • [c] Notes
  • [h] help

git repo

R-Basic


Clone repo

git clone https://github.com/CECADBioinformaticsCoreFacility/Intermediate_R_Course_2025.git


Slides Directly

https://cecadbioinformaticscorefacility.github.io/Advanced_R_Course_2025/

Session 2 + 4:: Differential Gene Expression

The Practice Dataset

The Practice Dataset

 

data_dir <- "." ## the current folder
matrix_file <- "GSE234563_Matrix.txt.gz"

# download it ..
#download.file(
#  url="https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE234563&format=file&file=GSE234563%5FMatrix%2Etxt%2Egz",
#  destfile=matrix_file
#)
# .. read it:
M <- read.table(fs::path(data_dir,
                         matrix_file),
                header = TRUE,
                sep = "\t",
                quote = ""
                )
# .. this is a huge matrix!
dim(M)
[1] 40946   144
# .. display the first 6 rows (output next page):
dt <- DT::datatable(M |> head())
dt

The Practice Dataset

Count Matrix Preprocessing

 

First of all, we convert the column names to all-lowercase, to avoid case issues:

colnames(M) <- tolower(colnames(M))

 

Then we discard genes with a total count < 10, as was done in the paper:

min_count <- 10
M <- M[rowSums(M) >= min_count,]
#M <- M[(rowSums(M) >= min_count) & 
#      apply(M,1,function(x) !any(x >= 10^3)) ,]

nrow(M)
[1] 32904

 

Finally we map matrix entries with fractional digits to the next higher integer value 
(the differential expression analysis expects integer counts):

M <- ceiling(M)

Introducing the DESeq2 Package

  • A Bioconductor package for Differential Gene Expression (DGE) analysis
  • Based on sophisticated mathematical theory (more on that later by Ali)
  • But its high-level functions allow to run an entire DGE analysis in three steps!
  • See browseVignettes("DESeq2")
  • Read the original paper
library(DESeq2)

browseVignettes("DESeq2")

Setting Up the Input for DESeq2

Setting Up the Input for DESeq2

Setting Up the Input for DESeq2

We do not have the colData – but it is actually easy to create this matrix starting from the count matrix column names:

# Step 1: split the count matrix column names at "_"
#         (result: a list of vectors, where each vector holds the fragments extracted from one name)
colData <- strsplit(colnames(M),"_") 

head(colData, n=2) 
[[1]]
[1] "old"       "female"    "creer"     "intestine" "1"        

[[2]]
[1] "old"       "female"    "creer"     "intestine" "2"        

 

# Step 2: make a matrix, in which each vector from Step 1 is a row
colData <- Reduce(rbind,colData)

head(colData, n=2) 
     [,1]  [,2]     [,3]    [,4]        [,5]
init "old" "female" "creer" "intestine" "1" 
     "old" "female" "creer" "intestine" "2" 

Setting Up the Input for DESeq2

# Step 3: We know what the row and column names should be, so let's set them:
dimnames(colData) <- list(colnames(M),
c("age","sex","genotype","tissue","replicate")
                          )
head(colData, n=2) 
                             age   sex      genotype tissue      replicate
old_female_creer_intestine_1 "old" "female" "creer"  "intestine" "1"      
old_female_creer_intestine_2 "old" "female" "creer"  "intestine" "2"      

 

# Step 4: The DESeq2 colData input should be a data.frame:
colData <- as.data.frame(colData)

# Step 5: Add a group which lists, for each sample, the levels of
#         all factors in this sample, concatenated by "." :
colData$group <- 
  apply(colData,1,
        function(x) paste(x[c("age","sex","tissue","genotype")],
                          collapse="."))

options(width=150) ## increase output line length 
# Show a bit more of the final result:
head(colData,n=4) 
                             age    sex genotype    tissue replicate                      group
old_female_creer_intestine_1 old female    creer intestine         1 old.female.intestine.creer
old_female_creer_intestine_2 old female    creer intestine         2 old.female.intestine.creer
old_female_creer_intestine_3 old female    creer intestine         3 old.female.intestine.creer
old_female_creer_kidney_1    old female    creer    kidney         1    old.female.kidney.creer

A Bold Try: Model the Data With Genotype Only

The most simple question we can ask to our data matrix is:  
Is the nmrHas transgene effect strong enough to manifest irrespective of age, sex and tissue?
Starting from this question, we will dig deeper both into the dataset and into modeling.  

 

library(DESeq2)

dds_genotype <-
  # High level step 1: set up the object structure
    DESeqDataSetFromMatrix(countData = M,
                           colData = colData,
                           design= ~ 1 + genotype 
                           ) |>
  # High level step 2: "do the math" (--> Ali!)
  #                    size factors factors,
  #                    dispersion estimates,
  #                    statistical tests ...
    DESeq()

At this point we can have the available test results listed:

resultsNames(dds_genotype)
[1] "Intercept"                 "genotype_nmrhas2_vs_creer"

A Bold Try: Model the Data With Genotype Only

We want to see the genotype effect …

res_genotype <- 
    results(dds_genotype, 
            name="genotype_nmrhas2_vs_creer"
            )

… in a nice table (see next slide):

dt <- DT::datatable(
           res_genotype |> as.data.frame() |>
           dplyr::filter(!is.na(padj), padj <= 0.05) |>
           dplyr::arrange(padj),

           options=list(scrollY="500px")
     ) 

# convert numeric columns to scientific notation
dt <- dt |>
      DT::formatSignif(
           sapply(dt$x$data, is.numeric),
           digits=2
      )

A Bold Try: Model the Data With Genotype Only

Why So Few Significant Genes?

Why So Few Significant Genes?

I had on purpose omitted a step which should really be done in the beginning: Get to know the sources of variability in your count matrix!  

All known such sources should be either eliminated or stated in the design given to DESeq2. Otherwise the neglected factors will be treated as random variance by the model and this may mask a real effect.

Why So Few Significant Genes?

Principal Component Analysis (PCA) projects the transcriptome of each sample onto a point on a two-dimensional canvas, such that the among-sample variance is maximal.

It helps to answer questions like:

  • Do the samples cluster as your hypotheses suggest?
  • If not, does the cluster structure suggest hidden factors which can be identified and included in the model?
  • Or does poor clustering of expected groups suggest experimental issues?

Why So Few Significant Genes?

Heatmaps display the transcriptomes directly as colored matrix columns. Clustering the columns can identify groups of samples with similar global gene expression, like PCA.

Boxplots of the matrix columns visualize the distributional properties of the per-sample counts.

Preparation for PCA

The DESeq2 package provides a so-called "variance stabilizing transformation", which reduces the increase in variance with count which is natural to count data. We will use the function to transform the input to PCA and heatmap functions.

# The "blind" parameter tells the function NOT to use the input object's
# model formula for identifying "valid" (expected) variation. 
# We do this here because we have not yet found an optimal formula.
# NOTE that if we have a trusted formula and want to use vst() to 
# prepare an object for downstream analysis, then blind=FALSE should 
# be used.
vsd_blind <- vst(dds_genotype, 
                 blind=TRUE
                )
saveRDS(vsd_blind,file="vsd_blind.rds")

What Drives the Structure?

plotPCA2(vsd_blind,
         intgroup = "group",
         ntop=500, 
         PCs=c(x=1,y=2))

There is very clear structure in the unlabelled dataset …

plotPCA2(vsd_blind,
         intgroup=c("genotype"),
         color=genotype,
         ntop=500, 
         PCs=c(x=1,y=2), )

… but genotype is more or less unrelated to this global structure!

What about the other variables?

What Drives the Structure?

 

Tissue determines the high level clusters!

Heatmap of the 500 Most Highly Expressed Genes

 

selected <- 
  order(rowMeans(counts(dds_genotype,
                        normalized=TRUE)
                ),
        decreasing=TRUE)[1:500]

df <- (colData(dds_genotype) |>
       as.data.frame())[,c("age",
                           "sex",
                           "genotype",
                           "tissue")]

pheatmap(assay(vsd_blind)[selected,], 
         cluster_rows=FALSE, 
         show_rownames=FALSE,
         show_colnames=FALSE,
         cluster_cols=TRUE, 
         annotation_col=df)

  • With the exception of spleen and WAT, all tissues have strongly distinct expression patterns
  • Both levels of the genotype factor are compatible with any of these global patterns

Boxplots of the Per-Sample Count Distributions

We want to plot the raw and normalized count distributions of each sample side by side, using ggplot(). For this we have to “melt” the log10 transformed matrices:

df <-
    rbind(reshape2::melt(log10(counts(dds_genotype,normalized=FALSE)+1)) |>
          mutate(mode="raw"),

          reshape2::melt(log10(counts(dds_genotype,normalized=TRUE)+1)) |>
          mutate(mode="normalized")
          ) |>
  # rename to make variable names more meaningful:
    dplyr::rename(gene=Var1, sample=Var2, count=value) |>
  
  # remove the trailing sample ID to create a common ID for all replicates
  # of a given factor level combination:
    mutate(group=sub("_\\d+$","",sample)) |>
  
  # ggplot needs to know the order of sub-plots ("raw" should come first)
    mutate(mode = factor(mode,
                         levels=c("raw","normalized")
                         )
         )

Boxplots of the Per-Sample Count Distributions

Then plot it:

ggplot(df,
       aes(y=sample,x=count,fill=group)
       ) +
    geom_boxplot(show.legend=FALSE) +
    facet_wrap(vars(mode),ncol=2) +
    theme_classic() +
    theme(axis.text=element_text(size=2)) 

Boxplots of the Per-Sample Count Distributions

Boxplots of the Per-Sample Count Distributions

  • While raw counts may differ in shape and location between samples, these differences should go away after a successful DESeq2 “median of ratios” normalization.

  • Failure to normalize indicates an issue!.  

  • Here, the reason is very likely the fact that the samples come from different tissues, with very different gene expression signatures. DESeq2’s approach to use the geometric expression mean over all samples of each gene as a pseudo-reference count may not be justified in this case.

Boxplots of the Per-Sample Count Distributions

To get an impression of the source of the distribution differences, we can look at the ordered median counts over the replicates of each level configuration:

cnt <- counts(dds_genotype,normalized=TRUE)
medians_log10 <- log10(colMedians(cnt |> as.matrix())+1)

tapply(medians_log10, # what to group
       sub("_\\d+$","",
           names(medians_log10)
          ), # by what to group: sample name  without replicate ID
       mean  # function to apply to grouped values
       ) |> 
  sort(decreasing=TRUE) # sort the result

## [see next page]

Boxplots of the Per-Sample Count Distributions

      old_male_creer_intestine     young_male_creer_intestine   young_male_nmrhas2_intestine   old_female_nmrhas2_intestine 
                     1.5478917                      1.5427134                      1.5222127                      1.5107087 
    old_male_nmrhas2_intestine   young_female_creer_intestine young_female_nmrhas2_intestine    young_female_nmrhas2_spleen 
                     1.4955060                      1.4806026                      1.4768777                      1.4417853 
    old_female_creer_intestine      young_female_creer_spleen      old_female_nmrhas2_spleen        old_female_creer_spleen 
                     1.3831509                      1.3737798                      1.3423507                      1.3355697 
        old_female_nmrhas2_wat             old_male_creer_wat           old_male_nmrhas2_wat       young_female_nmrhas2_wat 
                     1.3335537                      1.3148506                      1.3148082                      1.3083296 
          old_female_creer_wat    young_female_nmrhas2_kidney         young_female_creer_wat      old_female_nmrhas2_kidney 
                     1.2930518                      1.2785407                      1.2506189                      1.2443773 
       old_female_creer_kidney      old_female_nmrhas2_muscle      young_female_creer_kidney        old_female_creer_muscle 
                     1.2396100                      1.2381502                      1.2302532                      1.2239739 
        young_male_nmrhas2_wat          old_male_creer_muscle          old_male_creer_kidney    young_female_nmrhas2_muscle 
                     1.1557166                      1.1538079                      1.1480708                      1.1315603 
          young_male_creer_wat      young_female_creer_muscle        old_male_nmrhas2_kidney         old_female_creer_liver 
                     1.1265339                      1.1053778                      1.1016026                      1.0991615 
       old_male_nmrhas2_muscle        old_male_nmrhas2_spleen       old_female_nmrhas2_liver     young_female_nmrhas2_liver 
                     1.0725906                      1.0714532                      1.0663980                      1.0524654 
         old_male_creer_spleen       young_female_creer_liver        young_male_creer_kidney      young_male_nmrhas2_kidney 
                     1.0484869                      1.0440519                      1.0381674                      1.0274570 
       young_male_creer_spleen      young_male_nmrhas2_spleen        young_male_creer_muscle           old_male_creer_liver 
                     1.0251041                      1.0173429                      1.0163864                      0.9905823 
        old_male_nmrhas2_liver      young_male_nmrhas2_muscle         young_male_creer_liver       young_male_nmrhas2_liver 
                     0.9406456                      0.9160227                      0.8605382                      0.5546036 
  • intestine has very consistently the highest medians,
  • liver tends to be at the low end

Boxplots of the Per-Sample Count Distributions

  • It may be advisable to split the data matrix at least into “intestine” and the remaining samples
  • I am not going to do this in this presentation -- YOU will do it tomorrow in the project session :)
  • Here I will continue with the full matrix, and see whether a better model helps to recover more of the genotype effect

But Before We Go On …

… Ali On the Mathematics Behind DESeq2!

Understanding Model Formulae

 

Model formulae like ~ 1 + genotype provide a high level language

  • for the experimenter to conceptually express a hypothesis about his or her data

  • for modeling software to know the structure of the tests to compute  

  • Independent of the actual mathematics of the tests !

Understanding Model Formulae

Understanding Model Formulae

Understanding Model Formulae

Understanding Model Formulae

Understanding Model Formulae

Understanding Model Formulae

Understanding Model Formulae

Understanding Model Formulae

Understanding Model Formulae

Understanding Model Formulae – Multiple Factors

Understanding Model Formulae – Multiple Factors

Understanding Model Formulae – Multiple Factors

Understanding Model Formulae – Multiple Factors

Understanding Model Formulae – Multiple Factors

An Additive Model of Gene Expression Including All Factors

Including all factors in the model helps to properly capture their effects on the observed read counts. This in turn may expose more of a significant, although small, effects of the genotype factor.  

 

library(DESeq2)

dds_additive <-
  # High level step 1: set up the object structure
    DESeqDataSetFromMatrix(countData = M,
                           colData = colData,
                           design= ~ age + sex + tissue + genotype 
                           ) |>
    DESeq()

These are the available test results:

resultsNames(dds_additive)
[1] "Intercept"                  "age_young_vs_old"           "sex_male_vs_female"         "tissue_kidney_vs_intestine" "tissue_liver_vs_intestine" 
[6] "tissue_muscle_vs_intestine" "tissue_spleen_vs_intestine" "tissue_wat_vs_intestine"    "genotype_nmrhas2_vs_creer" 

Let’s Have a Look at the Genotype Effect in this Model

res_additive <- 
    results(dds_additive, name="genotype_nmrhas2_vs_creer")
dt <- DT::datatable(
           res_additive |> as.data.frame() |>
           dplyr::filter(!is.na(padj), padj <= 0.05) |>
           dplyr::arrange(padj),

           options=list(scrollY="50000px",
                        lengthMenu = list(c(10, -1), 
                                          c('10','All'))
                        )
     ) 

# convert numeric columns to scientific notation
dt <- dt |>
      DT::formatSignif(sapply(dt$x$data, is.numeric),
                       digits=2
      )

Let’s Have a Look at the Genotype Effect in this Model

Yet Another Alternative: Group Contrasts

Is there a way to compare groups of samples which differ only in the level of genotype – in the context of the full dataset, taking the levels of all experimental factors into account?

Breaking this sentence up into its components, we observe

  • we need new factors in colData which describe co-occurring levels of different factors in the same sample
  • we need a model in which we can compare levels of such compound factors

Yet Another Alternative: Group Contrasts

First we define another compound factor in colData:

options(width=150) ## increase output line length 

## collapse (but don't remove!) columns age, sex, and tissue
## a new column "background":
colData$background <- 
  apply(colData,1,
        function(x) paste(x[c("age","sex","tissue")],collapse="."))
  
head(colData,n=1)
                             age    sex genotype    tissue replicate                      group           background
old_female_creer_intestine_1 old female    creer intestine         1 old.female.intestine.creer old.female.intestine

Yet Another Alternative: Group Contrasts

Next we express our desired pairwise comparisons in a table:

contrasts <-
  data.frame(group= "group",
             A    = paste(colData[colData$genotype=="nmrhas2", "background"],
                          colData[colData$genotype=="nmrhas2", "genotype"],
                          sep="."
                         ),
             B=     paste(colData[colData$genotype=="creer", "background"],
                          colData[colData$genotype=="creer", "genotype"],
                          sep="."
                         )
  ) |> unique() # there is initially one copy per replicate!
saveRDS(contrasts,file="qmd_contrasts.rds")
head(contrasts,n=6)
   group                            A                          B
1  group old.female.intestine.nmrhas2 old.female.intestine.creer
4  group    old.female.kidney.nmrhas2    old.female.kidney.creer
7  group     old.female.liver.nmrhas2     old.female.liver.creer
10 group    old.female.muscle.nmrhas2    old.female.muscle.creer
13 group    old.female.spleen.nmrhas2    old.female.spleen.creer
16 group       old.female.wat.nmrhas2       old.female.wat.creer
nrow(contrasts)
[1] 24

Yet Another Alternative: Group Contrasts

 

We arrive at a test for these contrasts in two steps:

  • Run DESeq() with a 0 + group model. P-values from this model refer to whether or not the mean read count at a given “group” level is significantly different from zero. Importantly the calculations take the count distribution over all levels into account.

  • Then when passing the DESeq() result to the results() function, we can use its "contrast" argument to specifically compare a pair of “group” levels ` to be compared. Although each such comparison involves only two levels, it will inherit information on the full distribution from the DESeq() step.

Yet Another Alternative: Group Contrasts

Step 1:

dds_group <-
  DESeqDataSetFromMatrix(countData = M,
                         colData = colData,
                         design= ~ 0 + group) |>
  DESeq()

Step 2:

# a matrix allows to directly extract rows as vectors
# (unlike a data.frame):
m <- as.matrix(contrasts)

res_contrasts <- list()
for(i in 1:nrow(m)) {

    res_contrasts[[i]] <- results(dds_group,
                                  contrast=m[i,]
                          ) 
}
# the second name in each contrast pair without the trailing ".creer"
# is the background -- use as name:
names(res_contrasts)  <- sub("\\.creer$","",contrasts[,"B"])

Yet Another Alternative: Group Contrasts

res_contrasts[["old.female.intestine"]] |> as.data.frame() |>
  filter(padj <= 0.05) |> head(n=10)
         baseMean log2FoldChange     lfcSE      stat       pvalue         padj
Spry3     26.0529     -3.3636880 0.5851900 -5.748027 9.029089e-09 4.808012e-07
Tmlhe    498.9822     -1.4208992 0.2037621 -6.973325 3.095366e-12 4.077209e-10
Trub1    384.3361      0.3227119 0.1189521  2.712958 6.668559e-03 3.275283e-02
Ccdc186 1564.6499     -0.5470574 0.1727598 -3.166578 1.542439e-03 1.032009e-02
Casp7   1413.4460     -0.7996830 0.2129014 -3.756118 1.725691e-04 1.803400e-03
Acsl5   9971.4795     -1.4114589 0.3140923 -4.493771 6.997283e-06 1.316687e-04
Gm50331  276.9841      0.9263905 0.3173082  2.919529 3.505612e-03 1.987312e-02
Sfr1    2111.7307     -0.4495477 0.1319809 -3.406157 6.588422e-04 5.189392e-03
As3mt   1408.3211     -0.5860089 0.1836627 -3.190681 1.419381e-03 9.658478e-03
Trim8    952.4466      0.5123665 0.1586725  3.229082 1.241884e-03 8.631979e-03

Yet Another Alternative: Group Contrasts

 

The old.female.intestine background has more (3692) up-regulated genotype-responsive genes than any other background. In addition it has the second highest number (1922) of down-regulated genotype-responsive genes. We will have a quick look at enriched Gene Ontology terms in these two sets of genes, to see whether they contain interesting topics. [Tomorrow we will talk in depth about Functional Enrichment.]

Yet Another Alternative: Group Contrasts

 

require("org.Mm.eg.db")

n <- "old.female.intestine"
genes <- rownames(res_contrasts[[n]])

# extract a named vector of adjusted p-values:
x <- res_contrasts[[n]] |> pull(padj) |>  
     setNames(genes)
# same for log2FoldChange:
lfc <- res_contrasts[[n]] |> pull(log2FoldChange) |>  
       setNames(genes)

L <- names(x[!is.na(x)])
l <- names(x[!is.na(x) & (x <= 0.05) & (lfc >= 1)])

clusterProfiler::enrichGO(gene = l,
                          universe = L,
                          OrgDb = org.Mm.eg.db,
                          keyType="SYMBOL",
                          ont="BP"
                  ) |> clusterProfiler::dotplot()

Yet Another Alternative: Group Contrasts

 

L <- names(x[!is.na(x)])
l <- names(x[!is.na(x) & (x <= 0.05) & (lfc <= -1)])

clusterProfiler::enrichGO(gene = l,
                          universe = L,
                          OrgDb = org.Mm.eg.db,
                          keyType="SYMBOL",
                          ont="BP"
                  ) |> clusterProfiler::dotplot()

Yet Another Alternative: Group Contrasts

 

n <- "old.female.liver"
genes <- rownames(res_contrasts[[n]])

# extract a named vector of adjusted p-values:
x <- res_contrasts[[n]] |> pull(padj) |>  
     setNames(genes)
# same for log2FoldChange:
lfc <- res_contrasts[[n]] |> pull(log2FoldChange) |>  
       setNames(genes)

L <- names(x[!is.na(x)])
l <- names(x[!is.na(x) & (x <= 0.05) & (lfc >= 1)])

clusterProfiler::enrichGO(gene = l,
                          universe = L,
                          OrgDb = org.Mm.eg.db,
                          keyType="SYMBOL",
                          ont="BP"
                  ) |> clusterProfiler::dotplot()

Yet Another Alternative: Group Contrasts

 

L <- names(x[!is.na(x)])
l <- names(x[!is.na(x) & (x <= 0.05) & (lfc <= -1)])

clusterProfiler::enrichGO(gene = l,
                          universe = L,
                          OrgDb = org.Mm.eg.db,
                          keyType="SYMBOL",
                          ont="BP"
                  ) |> clusterProfiler::dotplot()